K-Means Presentation

A Deep Dive Into K-Means Clustering

Finding meaningful patterns within data has become obtrusive as data collection and management continues to grow at an unprecedented rate

  • K-means clustering caters to highly unpredictable data
  • Finds inexplicitly labeled groups
  • May provide unknown/hidden business insights

Key Concepts

By the end of this presentation we will have discussed the following concepts of k-means clustering:

  • Unsupervised learning
  • Clustering
  • Calculating distance, specifically using Euclidean Distance
  • Example interpretation of results

The Data

  • “Online Retail” from UC Irvine’s Machine Learning Repository
  • Online retailer based in UK
  • 541,909 observations of customer transactions
  • 8 variables: \(InvoiceNo\), \(StockCode\), \(Description\), \(Quantity\), \(InvoiceDate\), \(UnitPrice\), \(CustomerID\), \(Country\)

Preparing the Data

# A tibble: 10 x 8
   InvoiceNo StockCode Description     Quantity InvoiceDate UnitPrice CustomerID
   <chr>     <chr>     <chr>              <dbl> <chr>           <dbl>      <dbl>
 1 536365    85123A    WHITE HANGING ~        6 12/1/2010 ~      2.55      17850
 2 536365    71053     WHITE METAL LA~        6 12/1/2010 ~      3.39      17850
 3 536365    84406B    CREAM CUPID HE~        8 12/1/2010 ~      2.75      17850
 4 536365    84029G    KNITTED UNION ~        6 12/1/2010 ~      3.39      17850
 5 536365    84029E    RED WOOLLY HOT~        6 12/1/2010 ~      3.39      17850
 6 536365    22752     SET 7 BABUSHKA~        2 12/1/2010 ~      7.65      17850
 7 536365    21730     GLASS STAR FRO~        6 12/1/2010 ~      4.25      17850
 8 536366    22633     HAND WARMER UN~        6 12/1/2010 ~      1.85      17850
 9 536366    22632     HAND WARMER RE~        6 12/1/2010 ~      1.85      17850
10 536367    84879     ASSORTED COLOU~       32 12/1/2010 ~      1.69      13047
# i 1 more variable: Country <chr>

Regional Distribution of Data

Cleaned Data

  • Created variables \(Sales\), \(Orders\), and \(AvgSales\) to capture customer spending habits
  • Subset of data without UK sales
  • Removed null values and removed quantities \(< 0\)
# A tibble: 6 x 4
  CustomerID Sales Orders AvgSale
       <dbl> <dbl>  <int>   <dbl>
1      12347 4310       7    616.
2      12348 1797.      4    449.
3      12349 1758.      1   1758.
4      12350  334.      1    334.
5      12352 2506.      8    313.
6      12353   89       1     89 

Visualizing Customer Interaction by Total Sales and Orders

Figure 1: Top 5 Sales

Figure 2: Bottom 5 Sales

Visualizing the Average Customer Interaction

Figure 3: Top 5 Avg Sales

Figure 4: Bottom 5 Avg Sales

Introducing K-Means Clustering By Steps

There are 5 main steps to execute the k-means clustering method

  • Set \(k\), the ideal number of clusters
  • Assign each data point to a cluster by its proximity to the \(centroid\)
  • Compute each new \(centorid\) by finding cluster average
  • Reassign all data points to a cluster calculating its distance to a new centroid
  • Establish final cluster members by looping the cluster assignment step of all data points to a new centroid until the centroid does not change

Unsupervised Learning

  • Statistical modeling technique used to categorize/group data without the assistance of historical results or human intervention
  • Autonomous data categorization
  • Creates a data driven outcome

Clustering

  • Clustering is the act of partitioning data into meaningful groups based on similarity of attributes

  • The goal of clustering is to create insightful clusters to better understand connections in the data

Clustering

  • \(k=4\), the seed points are green, forming 4 respective clusters
  • Once the initial assignment of centroids is made, the Euclidean distance is used to establish cluster members by minimizing distance (red lines in the figure)

centroids

Euclidean Distance

  • Euclidean distance is used to find dissimilarity between data points in order to group similar data instances into clusters
  • In k-means it is done so by calculating the distance between a data object and cluster center using the following formula:

\[d(x,C_i)=sqrt(\sum_{i=1}^{N} (x_j−C_{ij})^2)\]

Objective Function:

  • The k-means clustering objective is to minimize the within-cluster sum of squares (variance)

It is formulated as:

\[ d(x,C_i)=(\sum_{i=1}^{k}*\sum_{x \in C_i}^{}(||x-\mu_i||)^2) \]

Euclidean Distance

  • \(k\) is the number of clusters

  • \(C_i\) represents the number of points in the cluster \(i\)

  • \(\mu_i\) represents the centroid mean of cluster \(i\)

  • In this context, similarity is inversely related to the Euclidean distance

  • The smaller the distance, the greater the similarity between objects

New Centroids

  • K-means clustering reassigns the data points to each cluster based on the Euclidean Distance calculation

  • A new centroid location is set by updating the position of each clusters mean center

data points reassigned to centroids

Determining K

  • The elbow method plots the within cluster sum of squares (WCSS) by \(k\)
```{r}
# Create an empty vector to store WCSS values
wcss <- vector("numeric", length = 10)
# Iterate over a range of K values (e.g., from 1 to 10)
for (i in 1:10) {
  model <- kmeans(df_norm[c("Sales", "Orders", "AvgSale")], centers = i, nstart = 10)
  wcss[i] <- ceiling(model$tot.withinss)
}
```

Elbow Method

library(ggplot2)
# Plot the WCSS values against the number of clusters
p1<- ggplot(data.frame(K=1:10, WCSS=wcss), aes(x=K, y=WCSS)) +
  geom_line() +
  geom_point() +
  labs(title="Elbow Method to Find Optimal K", x="Number of Clusters (K)", y="Within-Cluster-Sum-of-Squares (WCSS)") +
  scale_x_continuous(breaks = seq(0, 10, by = 1))

Clustering Model & Results:

set.seed(100)
model1 <- kmeans(df_norm[c("Sales", "Orders", "AvgSale")],4)
tidy(model1)
# A tibble: 4 x 6
   Sales Orders AvgSale  size withinss cluster
   <dbl>  <dbl>   <dbl> <int>    <dbl> <fct>  
1 -0.684 -0.964   0.155    75     42.2 1      
2  0.536 -0.500   1.21     73     55.8 2      
3  0.969  1.07    0.311   145    148.  3      
4 -1.03  -0.374  -1.16    125    120.  4      

Cluster Interpretation

Cluster 1:

  • Characteristics: Lower total sales, low orders, average per-order sales
  • Recommendation: Targeted marketing for newer customers

Cluster 2:

  • Characteristics: Average total sales, below-average orders, very high per-order sales
  • Recommendation: Marketing high-value items to infrequent buyers

Cluster Interpretation

Cluster 3:

  • Characteristics: Higher total sales, higher orders, average per-order value
  • Recommendation: Recommend low to mid-priced items for frequent buyers

Cluster 4:

  • Characteristics: Lowest total sales, low orders, low per-order sales
  • Recommendation: Minimal marketing efforts due to low ROI

Future Works

Challenges and Considerations -

  • Data Handling:

    Managing large and noisy datasets

  • Robustness:

    Ensuring robustness against outliers

  • Cluster Number Determination:

    Defining an appropriate number of clusters

Future Works

Research Focus -

  • Continued Exploration:

    Ongoing refinement of clustering techniques and cluster selection process

  • Industry Evolution:

    Adapting newer methods to meet evolving e-commerce demands